Chapter 13 GWR and spatially lagged regression

13.1 Learning objectives

By the end of this practical you should be able to:

  1. Explain hypothesis testing
  2. Execute regression in R
  3. Descrbie the assumptions associated with regression models
  4. Explain steps to deal with spatially autocorrelated (spatial similarlity of nearby observations) residuals.

13.2 Homework

Outside of our schedulded sessions you should be doing around 12 hours of extra study per week. Feel free to follow your own GIS interests, but good places to start include the following:

Assignment

From weeks 6-9, learn and practice analysis from the course and identify appropriate techniques (from wider research) that might be applicable/relevant to your data. Conduct an extensive methodological review – this could include analysis from within academic literature and/or government departments (or any reputable source).

Reading This week:

Remember this is just a starting point, explore the reading list, practical and lecture for more ideas.

13.4 Introduction

In this practical you will be introduced to a suite of different models that will allow you to test a variety of research questions and hypotheses through modelling the associations between two or more spatially reference variables.

In the worked example, we will explore the factors that might affect the average exam scores of 16 year-old across London. GSCEs are the exams taken at the end of secondary education and here have been aggregated for all pupils at their home addresses across the City for Ward geographies.

The London Data Store collates a range of other variables for each Ward and so we will see if any of these are able to help explain the patterns of exam performance that we see.

This practical will walk you through the common steps that you should go through when building a regression model using spatial data to test a stated research hypothsis; from carrying out some descriptive visualisation and summary statistics, to interpreting the results and using the outputs of the model to inform your next steps.

13.4.1 Setting up your Data

First, let’s set up R and read in some data to enable us to carry out our analysis.

read some ward data in

## [1] "prac9_data/statistical-gis-boundaries-london"        
## [2] "prac9_data/statistical-gis-boundaries-london/ESRI"   
## [3] "prac9_data/statistical-gis-boundaries-london/MapInfo"
## OGR data source with driver: ESRI Shapefile 
## Source: "C:\Users\Andy\OneDrive - University College London\Teaching\CASA0005repo\prac9_data\statistical-gis-boundaries-london\ESRI\London_Ward_CityMerged.shp", layer: "London_Ward_CityMerged"
## with 625 features
## It has 7 fields
## Integer64 fields read as strings:  POLY_ID
## Simple feature collection with 625 features and 7 fields
## geometry type:  POLYGON
## dimension:      XY
## bbox:           xmin: 503568.2 ymin: 155850.8 xmax: 561957.5 ymax: 200933.9
## epsg (SRID):    NA
## proj4string:    +proj=tmerc +lat_0=49 +lon_0=-2 +k=0.999601272 +x_0=400000 +y_0=-100000 +ellps=airy +towgs84=446.448,-125.157,542.06,0.1502,0.247,0.8421,-20.4894 +units=m +no_defs
## First 10 features:
##                         NAME  GSS_CODE HECTARES NONLD_AREA LB_GSS_CD
## 0          Chessington South E05000405  755.173          0 E09000021
## 1     Tolworth and Hook Rise E05000414  259.464          0 E09000021
## 2                 Berrylands E05000401  145.390          0 E09000021
## 3                  Alexandra E05000400  268.506          0 E09000021
## 4                   Beverley E05000402  187.821          0 E09000021
## 5                Coombe Hill E05000406  442.170          0 E09000021
## 6 Chessington North and Hook E05000404  192.980          0 E09000021
## 7              Surbiton Hill E05000413  166.482          0 E09000021
## 8                 Old Malden E05000410  180.016          0 E09000021
## 9                 St. Mark's E05000412  137.578          0 E09000021
##                BOROUGH POLY_ID                       geometry
## 0 Kingston upon Thames   50840 POLYGON ((516401.6 160201.8...
## 1 Kingston upon Thames  117160 POLYGON ((517829.6 165447.1...
## 2 Kingston upon Thames   50449 POLYGON ((518107.5 167303.4...
## 3 Kingston upon Thames   50456 POLYGON ((520480 166909.8, ...
## 4 Kingston upon Thames  117161 POLYGON ((522071 168144.9, ...
## 5 Kingston upon Thames  117159 POLYGON ((522007.6 169297.3...
## 6 Kingston upon Thames   50530 POLYGON ((517175.3 164077.3...
## 7 Kingston upon Thames   50457 POLYGON ((517469.3 166878.5...
## 8 Kingston upon Thames   50455 POLYGON ((522231.1 166015, ...
## 9 Kingston upon Thames   50450 POLYGON ((517460.6 167802.9...

Now we are going to read in some data from the London Data Store - https://data.london.gov.uk/

13.4.1.1 Cleaning the data as you read it in

Examining the dataset as it is read in above, you can see that a number of fields in the dataset that should have been read in as numeric data, have actually been read in as character (text) data.

If you examine your data file, you will see why. In a number of columns where data are missing, rather than a blank cell, the values ‘n/a’ have been entered in instead. Where these text values appear amongst numbers, the software will automatically assume the whole column is text.

To deal with these errors, we can force read_csv to ignore these values by telling it what values to look out for that indicate missing data

Now you have read in both your boundary data and your attribute data, you need to merge the two together using a common ID. In this case, we can use the ward codes to achieve the join

## Warning: Column `GSS_CODE`/`New code` joining factor and character vector,
## coercing into character vector

13.4.1.2 Additional Data

In addition to our main datasets, it might also be useful to add some contextual data. While our exam results have been recorded at the home address of students, most students would have attended one of the schools in the City.

Let’s add some schools data as well.

13.5 Analysing GCSE exam performance - testing a research hypothesis

To explore the factors that might influence GCSE exam performance in London, we are going to run a series of different regression models. A regression model is simply the expression of a linear relationship between our outcome variable (Average GCSE score in each Ward in London) and another variable or several variables that might explain this outcome.

13.5.1 Research Question and Hypothesis

Examining the spatial distribution of GSCE point scores in the map above, it is clear that there is variation across the city. My research question is:

What are the factors that might lead to variation in Average GCSE point scores across the city?

My research hypothesis that I am going to test is that there are other observable factors occurring in Wards in London that might affect the average GCSE scores of students living in those areas.

In inferential statistics, we cannot definitively prove a hypothesis is true, but we can seek to disprove that there is absolutely nothing of interest occurring or no association between variables. The null hypothesis that I am going to test empirically with some models is that there is no relationship between exam scores and other observed variables across London.

13.5.2 Regression Basics

For those of you who know a bit about regression, you might want to skip down to the next section. However, if you are new to regression or would like a refresher, read on…

The linear relationship in a regression model is probably most easily explained using a scatter plot:

Here, I have plotted the average GCSE point score for each Ward in London against another variable in the dataset that I think might be influential: the % of school days lost to unauthorised absences in each ward.

Remember that my null hypothesis would be that there is no relationship between GCSE scores and unauthorised absence from school. If this null hypothesis was true, then I would not expect to see any pattern in the cloud of points plotted above.

As it is, the scatter plot shows that, generally, as the \(x\) axis independent variable (unauthorised absence) goes up, our \(y\) axis dependent variable (GCSE point score) goes down. This is not a random cloud of points, but something that indicates there could be a relationshp here and so I might be looking to reject my null hypothesis.

Some conventions - In a regression equation, the dependent variable is always labelled \(y\) and shown on the \(y\) axis of a graph, the predictor or independent variable(s) is(are) always shown on the \(x\) axis.

I have added a blue line of best-fit - this is the line that can be drawn by minimising the sum of the squared differences between the line and the residuals. The residuals are all of the dots not falling exactly on the blue line. An algorithm known as ‘ordinary least squares’ (OLS) is used to draw this line and it simply tries a selection of different lines until the sum of the squared divations between all of the residuals and the blue line is minimised, leaving the final solution.

As a general rule, the better the blue line is at summarising the relationship between \(y\) and \(x\), the better the model.

The equation for the blue line in the graph above can be written:

(1)\[y_i = \beta_0 + \beta_1x_i + \epsilon_i\]

where:

\(\beta_0\) is the intercept (the value of \(y\) when \(x = 0\) - somewhere around 370 on the graph above);

\(\beta_1\) is sometimes referred to as the ‘slope’ parameter and is simply the change in the value of \(y\) for a 1 unit change in the value of \(x\) (the slope of the blue line) - reading the graph above, the change in the value of \(y\) reading between 1.0 and 2.0 on the \(x\) axis looks to be around -40.

\(\epsilon_i\) is a random error term (positive or negative) that should sum to 0 - esentially, if you add all of the vertical differences between the blue line and all of the residuals, it should sum to 0.

Any value of \(y\) along the blue line can be modelled using the corresponding value of \(x\) and these parameter values. Examining the graph above we would expect the average GCSE point score for a student living in a Ward where 0.5% of school days per year were missed, to equal around 350, but we can confirm this by plugging the \(\beta\) parameter values and the value of \(x\) into equation (1):

## [1] 350

13.5.3 Running a Regression Model in R

In the graph above, I used a method called ‘lm’ in the stat_smooth function in ggplot2 to draw the regression line. ‘lm’ stands for ‘linear model’ and is a standard function in R for running linear regression models. Use the help system to find out more about lm - ?lm

Below is the code that could be used to draw the blue line in our scatter plot. Note, the tilde ~ symbol means “is modelled by”

## 
## Call:
## lm(formula = `Average GCSE capped point scores - 2014` ~ `Unauthorised Absence in All Schools (%) - 2013`, 
##     data = LonWardProfiles)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -59.753 -10.223  -1.063   8.547  61.842 
## 
## Coefficients:
##                                                  Estimate Std. Error t value
## (Intercept)                                       371.471      2.165   171.6
## `Unauthorised Absence in All Schools (%) - 2013`  -41.237      1.927   -21.4
##                                                  Pr(>|t|)    
## (Intercept)                                        <2e-16 ***
## `Unauthorised Absence in All Schools (%) - 2013`   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 16.39 on 624 degrees of freedom
## Multiple R-squared:  0.4233, Adjusted R-squared:  0.4224 
## F-statistic:   458 on 1 and 624 DF,  p-value: < 2.2e-16

13.5.3.1 Interpreting and using the model outputs

In running a regression model, we are effectively trying to test (disprove) our null hypothesis. If our null hypothsis was true, then we would expect our coefficients to = 0.

In the output summary of the model above, there are a number of features you should pay attention to:

Coefficient Estimates - these are the \(\beta_0\) (intercept) and \(\beta_1\) (slope) parameter estimates from Equation 1. You will notice that at \(\beta_0 = 371.471\) and \(\beta_1 = -41.237\) they are pretty close to the estimates of 370 and -40 that we read from the graph earlier, but more precise.

Coefficient Standard Errors - these represent the average amount the coefficient varies from the average value of the dependent variable (its standard deviation). So, for a 1% increase in unauthorised absence from school, while the model says we might expect GSCE scores to drop by -41.2 points, this might vary, on average, by about 1.9 points. As a rule of thumb, we are looking for a lower value in the standard error relative to the size of the coefficient.

Coefficient t-value - this is the value of the coefficient divided by the standard error and so can be thought of as a kind of standardised coefficient value. The larger (either positive or negative) the value the greater the relative effect that particular independent variable is having on the dependent variable (this is perhaps more useful when we have several independent variables in the model) .

Coefficient p-value - Pr(>|t|) - the p-value is a measure of significance. There is lots of debate about p-values which I won’t go into here, but essentially it refers to the probability of getting a coefficient as large as the one observed in a set of random data. p-values can be thought of as percentages, so if we have a p-value of 0.5, then there is a 5% chance that our coefficient could have occurred in some random data, or put another way, a 95% chance that out coefficient could have only occurred in our data.

As a rule of thumb, the smaller the p-value, the more significant that variable is in the story and the smaller the chance that the relationship being observed is just random. Generally, statisticians use 5% or 0.05 as the acceptable cut-off for statistical significance - anything greater than that we should be a little sceptical about.

In r the codes ***, **, **, . are used to indicate significance. We generally want at least a single * next to our coefficient for it to be worth considering.

R-Squared - This can be thought of as an indication of how good your model is - a measure of ‘goodness-of-fit’ (of which there are a number of others). \(r^2\) is quite an intuitite measure of fit as it ranges between 0 and 1 and can be thought of as the % of variation in the dependent variable (in our case GCSE score) explained by variation in the independent variable(s). In our example, an \(r^2\) value of 0.42 indicates that around 42% of the variation in GCSE scores can be explained by variation in unathorised absence from school. In other words, this is quite a good model. The \(r^2\) value will increase as more independent explanatory variables are added into the model, so where this might be an issue, the adjusted r-squared value can be used to account for this affect

13.5.3.2 broom

The output from the linear regression model is messy and like all things R mess can be tidied, in this case with a broom! Or the package broom which is also party of the package tidymodels.

Here let’s load broom and tidy our output…you will need to either install tidymodels or broom

## # A tibble: 2 x 5
##   term                                     estimate std.error statistic  p.value
##   <chr>                                       <dbl>     <dbl>     <dbl>    <dbl>
## 1 (Intercept)                                 371.       2.16     172.  0.      
## 2 `Unauthorised Absence in All Schools (%~    -41.2      1.93     -21.4 1.27e-76

We can also use glance() from broom to get a bit more information, such as \(r^2\) and the adjusted r-squared value.

## # A tibble: 1 x 11
##   r.squared adj.r.squared sigma statistic  p.value    df logLik   AIC   BIC
##       <dbl>         <dbl> <dbl>     <dbl>    <dbl> <int>  <dbl> <dbl> <dbl>
## 1     0.423         0.422  16.4      458. 1.27e-76     2 -2638. 5282. 5296.
## # ... with 2 more variables: deviance <dbl>, df.residual <int>